Probabilistic projections of El Niño Southern Oscillation properties accounting for model dependence and skill

The El Niño – Southern Oscillation (ENSO) is a dominant mode of global climate variability. Nevertheless, future multi-model probabilistic projections of ENSO properties have not yet been made. Main roadblocks that have been hindering making these projections are climate model dependence and difficulty in quantifying historical model performance. Dependence is broadly defined as similarity between climate model output, assumptions, or physical parameterizations. Here, we propose a unifying metric of relative model performance, based on the probability density function (PDF) of ENSO paths. This metric is applied to assess the overall skill of Climate Model Intercomparison Project phase 6 (CMIP6) climate models at capturing ENSO. We then perform future multi-model probabilistic projections of changes in ENSO properties (from years 1850–1949 to 2040–2099) under the shared socioeconomic pathway scenario SSP585, accounting for model skill and dependence. We find that future ENSO will likely be more seasonally locked (89% chance), and have a longer period (67% chance). Yet, the jury is still out on future ENSO amplification. Our method reduces uncertainty by up to 37% compared to a simple approach ignoring model dependence and skill.

. Workflow of the multi-model projection method. t refers to monthly Niño 3 temperatures, and h refers to monthly equatorial mean thermocline depths. Blue rectangles represent information derived from CMIP6 models; red rectangles represent information derived from observations. Black rectangles show information derived from the fusion of both sources of information.  Table 1.

Results
We first estimate 50-year PDFs using historical output of 32 CMIP6 climate models as well as observations ( Table 1, "Methods") using a recently published method 35,37 . The method can account for non-Gaussianity, long-term dependence, and cross-correlations between the SSTs and thermocline depth anomalies. We choose a subset of 24 GCMs which can be appropriately approximated by the stochastic model (Table 1). We propose that a unifying measure of relative model performance with respect to observations can be a distance between these PDFs. We show model-observational distances based on symmetric Kullback-Leibler divergences in Fig. 2 (see Eqs. (4) and (5) in the "Methods" section). This method asserts that the best performing models are associated with the smallest distances. There is a large range in the model-observational distances (expressed as Kullback-Leibler divergencies), from about 1500 to more than 6000.
An important question is how the model-observational distances are related to model performance on various ENSO metrics, such as skewness, standard deviation, etc. Since there are many metrics 25 , even the best performing ENSO models can have issues at simulating some of them. On the other hand, worst ENSO models can perform well at certain metrics. The relationship between model-observational distance and a single metric is therefore expected to be weak. Furthermore, it is not immediately clear how ENSO performance at a range of metrics can be combined into a single score, that could be regressed with the distance. Thus, it is beyond the Table 1. List of CMIP6 climate models used in this study. Checkmarks indicate the climate models which are fitted by the stochastic model appropriately well. More detailed information on CMIP6 models, their components and parameterizations can be found at https:// search. es-doc. org/. www.nature.com/scientificreports/ scope of this paper to quantitatively relate model distance to model metrics. Instead, we qualitatively analyze fundamental ENSO simulation issues in the five closest and five furthest models. These issues are enumerated in Table 2, and figures supporting this table are provided in the Supplementary Material. We find that the models poorly performing on the distance measure tend to be associated with more fundamental issues in simulating ENSO. For example, only three serious issues are detected in the five closest models. At the same time, we find ten issues collectively in the five furthest models. Among the poorest performing models, one common issue is the lack of noise, as quantified here by standard deviation of either monthly Niño 3 SST or equatorial thermocline depth anomaly tendencies at a specific ENSO state (e.g. Niño 3 SST and thermocline depth anomaly). Note that these standard deviations are derived not from GCM/observed output directly, but from conditional PDFs of tendencies calculated in the fitted stochastic models. Another issue found in two models (INM-CM4.8 and INM-CM5.0) is lack of circular ENSO dynamics between Niño 3 SST and equatorial thermocline depth anomalies, at least for the analyzed months. Assuming these snapshots provide a good indicator of ENSO behavior for the remaining months, this indicates that the two variables are uncoupled from each other dynamically, in contradiction to recharge oscillator theory [38][39][40][41] and observations 37,42 . Other issues include wrong ENSO seasonality and absence of spectral peak in the thermocline anomalies. Note that some of these issues (wrong SST-thermocline dynamics) may not be reflected in simpler ENSO metrics, pointing to their drawbacks when evaluating ENSO performance.
Among the best-performing models, we found an issue with ENSO seasonality in MPI-ESM1-2-LR and too high SST noise forcing and lack of circular ENSO dynamics in the CAS-ESM2-0 model. This qualitative analysis suggests that the distance metric is related to ENSO performance. Therefore, this metric could be used as a single measure of model performance and as a replacement for a multitude of metrics used previously. This could streamline model development, validation, and inter-comparison efforts. However, a more quantitative analysis of relationships between the distance metric and other ENSO performance metrics should be carried out in the future.
Next, we validate the methodology for the PDF reconstruction. Since the PDFs are high-dimensional it is difficult to validate their estimation in the original space. Hence, we use a method known as multi-dimensional scaling (MDS) to convert the distances into model and observed positions in a low-dimensional space. The MDS is a dimension reduction method akin to principal component analysis (PCA). Unlike the PCA, the input for MDS is not modelled values, but inter-model distances 43 . This procedure allows for better visualization of the PDFs. The output for the MDS are positions of the models in a low dimensional space of principal coordinates R 1 , … R p . The MDS suggests that the uncertainties associated with estimation of the observed and modelled PDFs is reasonably low for the three leading MDS dimensions (Supplementary Note).
The multi-model projection method uses a low-dimensional representation of modelled positions 18 . To reconstruct those positions, the MDS was performed using the inter-model and model-observational distances from the first ensemble members ("r1i1p1f1") for the 24 models and observations. Five and six (e = 5, 6) dimensional representations are chosen as they yield the best performance (see "Methods"). To illustrate relationships between MDS coordinates and future changes, the response plots of 1850-1949 to 2040-2099 changes in six ENSO properties over the Niño 3 region (standard deviation, skewness, seasonality, ENSO spectrum peak, kurtosis and lag-1 autocorrelation) are shown in Fig. 3 as a function of the first two principal MDS coordinates, R 1 and R 2 . We note that these coordinates have no physical meaning, although they could be in theory correlated with certain physical quantities. There is a degree of randomness in the CMIP6 model responses due to internal variability and due to the uncertainty in the estimation of the MDS coordinates. However, some marked clusters are visible. They include the cluster of intensified, more persistent ENSO with lower kurtosis and shorter spectral peak in the middle-left of the configuration space (Fig. 3a,d,e,f), and decreased skewness at high R 1 values vs. increased skewness as low R 1 values (Fig. 3b). www.nature.com/scientificreports/ We then apply the recently developed multi-model projection method that accounts for model skills and dependence 18 . The key idea of the method is relaxing the assumption commonly used in the popular multi-model projection method called "Bayesian model averaging" 23,44 . That method constructs future projections as the weighted average of modelled PDFs. The weight is usually associated with model skill at representing presentday climate. This assumption is that in reality only one model can be correct, which does not properly represent model dependence. We relax this assumption. However, this leads to more complex mathematical expressions that can no longer be easily evaluated. Nonetheless, if we formulate model correctness hypotheses as regions www.nature.com/scientificreports/ in parameter space we can obtain future PDFs by instead sampling from the subspace of the parameter space where at least one model is correct 18 . The method, modified from its original version to fit the problem at hand, states that associated with each future change Δ (value in the future period minus the value in the historical period) there is a correctness indicator for each climate model i, depending on its relative position in space of R 1 , …, R e , and future change ( r 1,m i , . . . , r e,m i , � i ) , and the position of "truth" consisting of observations and given future change (r 1,o ,…, r e,o , Δ). The tolerance controlling the model correctness cutoff is controlled by a parameter k (see "Methods"). The method works by sampling all pairs of (k, Δ) for which at least one model is considered correct (see "Methods"). The prior probability of k is determined using cross-validation ("Methods"). It has been mathematically proven that this methodology considers both model skill at capturing present-day observations, and all orders of model dependence (e.g. dependence of each model on other models individually, on pairs of other models, triplets of other models, and so on) both in terms of present-day output and future projections 18 . We address the uncertainty in the correct MDS dimensionality by simply combining the projections from simulations using e = 5 and e = 6. We note that model dependence and model non-exclusivity are related terms. Here, we mathematically define dependence in terms of model non-exclusivity as = 1≤i<j≤l p(M i ∩ M j ) where p(M i ) is the probability that model i is correct, and l is the number of available models. A measure of non-exclusivity (or dependence) for a particular model could be similarly defined as i = 1<j<l,j� =i p(M i ∩ M j ) . In other words, it is a tendency for a model to be jointly correct with other models, and for its output to cluster together with other models.
We show the future projections for key ENSO properties in Fig. 4, while the summary of the results is presented in Table 3. The projection results are compared with simple poor-man's resampling of PDFs construed from 24 available future modelled changes using kernel density estimation (KDE; see "Methods"). We note that the KDE-based poor man's approach is statistically wrong, as it incorrectly assumes that modelled changes are independent samples from the underlying PDF. In addition, it does not account for differential model skill at simulating ENSO. Thus, the poor man's projections are shown for mere comparison with the novel method.
Our work projects a 64% chance of ENSO amplification (about as likely as not in the language of the IPCC uncertainty guidance note 45 ), with the mean strengthening of just 0.058 K (Fig. 4a, Table 3). The likelihood of an increase is down from 73% (likely) obtained from a simple resampling of the KDE-based PDF. Earlier studies based on Climate Model Intercomparison Project phase 5 (CMIP5) yielded inconclusive results: while one work projected future ENSO amplification 29 , other studies did not agree on the direction of future ENSO change 8,15,30 . Furthermore, a centennial time-scale modelling study argued that any possible intensification under increased atmospheric CO 2 concentrations would be temporary, with weaker ENSO on the long term 31 .
Yet, the newest CMIP6 climate model ensemble shows robust increases in ENSO amplitude 34 , with 88.4% percent of models predicting an increase in the twenty-first century over the twentieth century under the SSP585 forcing scenario. The resampled KDE-based PDF from our work shows a smaller 73% probability of an increase, and the difference between the studies could be due to a different region (we use Niño 3 compared to Niño 3.4 used in that work), and period. However, when model skill and dependence are considered, the percentage of future increases falls even more down to 64%. Indeed, out of three most likely models based on our analysis, one (CAM-ESM2-0) projects future ENSO weakening of -0.044 K.
Large uncertainty remains regarding future skewness projections (49% probability of an increase; Table 3, Fig. 4b). A similar percentage (47%) is obtained using the poor man's PDF resampling (Table 3). Previous work using CMIP5 model generation suggested that future skewness may decrease 15,30 . The gap between our and previous works could be due to a number factors, including using different sets of global climate models, periods for calculating the changes, and regions for averaging ENSO anomalies.
Our approach projects a robust increase in ENSO seasonality (89% chance), with a mean increase of 0.3 (Fig. 4c, Table 3). We note that observed seasonality (defined here as the ratio of ENSO standard deviation for the most variable month over one for the least variable month) for years 1915-2016 is 1.9. Therefore, the projected mean increase is 16%. These results are markedly different from the KDE-based "poor-man's" projections, which indicate only a 64% chance of an increase. Three most likely models under our analysis, MPI-ESM1-2-LR, CIESM and CAS-ESM2-0, project increases of 0.23, 0.33 and 0.47, respectively. These results are in contrast to inconclusive findings previously reported based on a set of CMIP5 climate models 15 .
Furthermore, our work indicates a likely (67% probability) lengthening of ENSO period. This is in stark contrast to a 37% chance of lengthening found in the KDE-based projection PDF. Furthermore, both lower and upper endpoints of the 90% confidence intervals are shifted towards higher values, with a small (7.7%) decrease in uncertainty (Fig. 4d, Table 3). The most likely outcome (mode of the PDF) is that of 1.3 years lengthening of ENSO period, which mirrors the value output from the highest-probability model, MPI-ESM1-2-LR. However, the model with the second highest probability, CIESM, showcases a small decrease of -0.55 years, which may explain a bump in the projection PDF around that value (Fig. 4d). ENSO period has not been previously intensely scrutinized, and available work did not project any large future changes in the ENSO spectrum 46 .
Furthermore, our projections offer mixed results on future kurtosis changes (58% chance of an increase). These results are markedly different from the likely (68% chance) decreases found using the poor man's approach of resampling the KDE-based PDF. For this variable, there is only a minor reduction (4.5%) of uncertainty owing to considering model skill and dependence. There is additionally a 61% chance of increased month-on-month autocorrelation, a small drop down from 67% (likely) under a poor-man's approach. Again, there is only a small (4.5%) decrease in projection uncertainty compared to the poor-man's approach in case of autocorrelation.
Compared to the poor-man's approach of resampling PDFs of future changes obtained through the KDE method, there is a reduction in uncertainty for all projection variables. The smallest reduction (4.5%) is found for kurtosis and lag-1 autocorrelation, while the largest reduction (37%) occurs for seasonality projections. The reduced projection uncertainty is thus a major benefit of our approach. We remind the reader that the poor-man's www.nature.com/scientificreports/ www.nature.com/scientificreports/ approach should never be used to make probabilistic projections; we include it solely with the purpose of showcasing the reduction of uncertainty owing to the novel projection method. An advantage of our method compared to deterministic approaches is that it allows us to characterize the range of uncertainty using various metrics, for example, the 90% confidence interval. Reflecting the overall future projected increase in seasonality, the associated 90% confidence interval is asymmetric around zero, ranging from − 0.089 to 0.70 (Fig. 4c). A spectrum peak shortening or lengthening by two years cannot be ruled out (Fig. 4d). Large decreases in kurtosis are more likely than increases of similar magnitude due to a large lower tail (Fig. 4e). In addition, we note that the model range is not always a reliable indicator of the projection uncertainty range. For example, there are incidences where at least one of the bounds of the 90% confidence interval is outside of the model range (Fig. 4). This illustrates the limits of using model range as an indicator of projection uncertainty.
Furthermore, we find that properly accounting for model dependence is numerically important. An existing popular method of combining models known as "Bayesian model averaging (BMA)" handles dependence using the following assumption: only one model can be true at a time ("mutually exclusive" events). In other words, if one model is correct, then all other models must be incorrect: p M i |M j = 0 , i � = j . If this is the case, then the sum of model probabilities is exactly one within our sample space (which asserts that at least one climate model is correct): l i=1 p(M i ) = 1 , where l is the number of models. More generally, the Law of Total Probability states that: with the first term l i=1 p(M i ) being greater than one. Previous work hypothesized that the assumption of model exclusivity in the BMA (e.g. ignoring probabilities of model pairs 1≤i<j≤l p M i ∩ M j , triplets 1≤i<j<k≤l p M i ∩ M j ∩ M k , etc.) may be relatively unimportant when the dimensionality of observational constrains on climate models is high (> 2). However, here we find that even with five MDS dimensions for case of standard deviation projections, l i=1 p(M i ) = 2.11 � = 1 . This suggests that accounting for model dependence in a proper way is critical even when multiple dimensions are used to constrain climate models.
Our work is subject to several caveats. First, we only consider time series aspects of ENSO without any spatial information and constrain our analysis to the Niño 3 region. We choose this region as it is important for dynamical ENSO evolution within the recharge oscillator theory [39][40][41]47 . However, previous work has found that the ENSO projections differ when they are made at the locations of model-dependent centers of ENSO variability 29 . Considering other ENSO regions is subject of future work. Second, we neglect the uncertainty in the present-day MDS location of observations. However, sensitivity analysis to the observations (see "Methods" and Supplementary Note) suggests that the errors in the observational estimate in three leading MDS dimensions are small. In addition, MDS location of pseudo-observations and the observations are similar, suggesting that the errors introduced due to fitting the stochastic model are small. Third, we assume that tolerance for defining model credibility regions is the same for all models. A more thorough treatment of model uncertainties should be done in the follow-up work. Fourth, we assume that ENSO over the periods used for the analysis is stationary, which has been suggested by a previous study 48 . Fifth, there is a discrepancy in the observational and the modelled periods: the GCMs output is used over the period 1871-2014, while the observations cover the period from 1915 to 2016. The length of the observational period is limited by the presence of a dense observational network; we found that extending the period back in time leads to the rapid deterioration of the stochastic model fits. At the same time, discarding earlier GCM data is expected to increase the errors in the estimation of GCM positions in the configuration space. However, our analysis suggests that the length of model time series is sufficient, and there are no indications internal variability is an issue. This is supported by proximity of different ensemble members of same GCMs in leading MDS space dimensions (Supplementary Note).
(1) Table 3. Summary of ENSO multi-model projections. Key properties of multi-model probabilistic ENSO projections for changes between years 1850-1949 and 2040-2099 from CMIP6 climate models under the SSP585 forcing scenario using the novel method and compared to the poor man's approach of resampling of PDFs obtained using kernel density estimation. Probability of increase is re-stated using the IPCC uncertainty guidance note terminology 45 . Likely changes (using the IPCC guidance note terminology) are shown in bold.

Variable
Projection mean Probability of an increase (novel method)

Conclusions
Here we for the first time present multi-model probabilistic projections of ENSO properties. The projections account for the differential skill of climate models at capturing present-day high-dimensional PDF of ENSO paths (this PDF completely describes the underlying stochastic process assuming it is limited by 50 years), and for the dependence of climate models. We find that future ENSO will likely display more seasonality (89% probability) and have a longer spectrum peak (67% probability), with the most likely spectrum peak lengthening of 1.3 years. We compare our projections with a poor-man's approach or sampling the probability density of future changes obtained through kernel density estimation. We stress that such poor man's approach should not be performed in practice as it does not account for model skill and dependence, and we include it here only for comparison purposes. We find that the novel approach reduces the projection uncertainty (measured by the width of 90% confidence intervals) by 4.5-37%, depending on the projection variable, with the largest reduction for ENSO seasonality. In addition, projections in some cases change qualitatively compared to the poor man's approach (based on the IPCC terminology). For example, kurtosis increases change from "unlikely" to "about as likely as not", and lag-1 autocorrelation increases change from "likely" to "about as likely as not". Furthermore, we attest that accounting for model dependence is important numerically, even when observational constraints include several dimensions. In addition, we propose a unifying PDF-based metric of climate model performance at ENSO time series properties and rank the ENSO performance of CMIP6 models. Future work should extend this method to consider the spatial structure of ENSO. Our method is not restricted to ENSO or atmospheric sciences and can be used with other monthly cyclostationary time series.

Methods
Data. We use 32 climate models from the CMIP6 project. The list of models is given in Table 1. These models were chosen as only they have bug-free output necessary to perform this analysis. Historical run ensemble members r1i1p1f1 for years 1871-2014 are used for fitting the stochastic models, while historical output for years 1850-1949 and future output under the SSP585 forcing for years 2040-2099 is used to calculate future changes. In addition, ensemble members r2i1p1f1 for three GCMs (BCC-CSM2-MR, CAMS-CSM1-0 and INM-CM5-0) are also utilized for years 1871-2014. All relevant monthly output is horizontally interpolated onto a 1°x1° grid using nearest neighbor interpolation. To calculate Niño 3 index t, we first perform the spatial average of SST over the Niño 3 region (5°S-5°N, 150°W-90°W), then calculate anomalies from monthly means, and then linearly detrend these anomalies over each respective period.
A similar procedure is used for calculating equatorial mean thermocline h over the Pacific for the latitudes 5°S-5°N. First, for each interpolated grid point the depth of the 17-degree isotherm is obtained from monthly potential temperature fields. Preliminary data analysis indicates that the difference between potential temperature and temperature can be ignored for the thermocline depth calculations. Then, the depth is horizontally averaged over the equatorial Pacific, the annual cycle is removed, and the anomalies are linearly detrended.
The data extraction procedure for observations (years 1915-2016) is similar, except no interpolation of the observed datasets is done. Observed SST anomalies originate from the ERSSTv5 dataset 49  We also perform sensitivity analysis with respect to the observations. To do that, we construct the second observational dataset, where SST anomalies originate from the HadISST dataset 51 , and the sources for ocean temperature used to calculate the thermocline depth anomalies are split between two datasets. Specifically, they come from SODAv2.2.4 for years 1915-1957 and from ORAS4 for years 1958-2016 52 . Note that to our knowledge SODAv2.2.4 is the only available ocean reanalysis available prior to year 1958. Stochastic models. Stochastic models are fitted to the 1871-2014 monthly Niño 3 index and thermocline depth following previous work 35,37 . Specifically, the conditional probabilities of monthly Niño 3 changes ( t ) and thermocline depth changes ( h ) given the state (t,h) are found from the joint probabilities p(�t, t, h) and p(�h, t, h) , which are, in turn, estimated using kernel density methods separately for each calendar month 37 . The non-Markov part of the model describes the normal transforms of the SST and thermocline depth changes using the multivariate normal distribution. Specifically, where subscript t stands for temperature, superscript indicates time index, −1 is the inverse standard normal cumulative distribution function (CDF), F is conditional cumulative distribution function (CCDF), and m is month. In Eq. (2) the changes are defined as �t (s) = t (s+1) − t (s) and similarly for the thermocline depth anomalies. A similar equation is used for normal transforms of thermocline depth changes ξ (j) h . The transforms are then collected into a single vector which is modelled using the multivariate normal distribution: N(0, �) . The covariance matrix of this distribution is estimated from data. The model allows to find the theoretical cyclostationary probability of any sequence of ENSO trajectories z: www.nature.com/scientificreports/ The stochastic models are fitted using a procedure that has been modified from previous work 35,37 . Consistent with these studies, we use bandwidth SK during the kernel density estimation, where S is a numeric variable and K is the two-stage plug-in estimator for the bandwidth matrix 53 . However, the procedure for estimating S is different. Here, we develop a systematic procedure for choosing S. First, for each GCM, different values of S are tried, starting from an initial value of 2, and decreasing in steps of 0.25. For each value, the Markov part of the model is first fitted on 50×50×50 grids of ( �t, t, h) and ( �h, t, h) . For the non-Markov part, we set the size of to cover the period of 143 years (1 year less than the length of the model output), and optimize following previous work 35 . For each S, we keep track of the AIC of the model output over years 1871-2013. If the AIC at the next value of S in the sequence starts increasing, or if a multimodality is detected in any of the marginal PDFs p(�t) , p(t) or p(h) obtained from the joint PDF p(�t, t, h) for the representative month of January, we keep the current value of S as the final value. Afterwards, we obtain the multi-model mean of S determined in this way across all GCMs (0.98) and re-fit the stochastic models for all 32 GCMs using this value. We have tried using different S values for each GCM, however, it led to inconsistent PDF estimation based on the MDS analysis. Therefore, we opt to use the same value for all GCMs. This value of S is also used for fitting the ensemble member r2i1p1f1 output for three GCMs.
We assess S separately for the observations in a similar way over the period 1915-2016. In this case, covers 101 years, and AIC is calculated for years 1915-2015. We estimate the observed S to be 1.5. It is natural to hypothesize that S is different between the GCMs and the observations because of the different length of the data. Specifically, when the sample size is smaller, we need to make sure that there is enough smoothing to remove spurious multimodalities and discontinuities in the PDFs, hence we expect a larger S.
The goodness of fit of the stochastic models is estimated using their relative error at representing the original GCM output and comparing this error to the relative error of long runs of the stochastic model fitted to EC-Earth3-Veg GCM at representing its own 144-year time segment. EC-Earth3-Veg is chosen due to its positive performance on a number of ENSO-related metrics 25 . The procedure is described in previous work 35 , except the length of the short segment to which the long runs are compared to is different. The stochastic model is deemed to fit reasonably well to 24 out of 32 models (Table 1). We thus choose these models for further analysis.

Inter-model and model-observational distances. Once the model and observed monthly ENSO path
PDFs p(z) for 50-year periods are obtained, we set out to calculate distances between 600-dimensional PDFs for all GCMs, and observations. Here we use the symmetric version of the Kullback-Leibler divergence: where is the standard Kullback-Leibler divergence. Here p M i (z) and q M j (z) represent PDFs of vector z being simulated by the ith and jth models, respectively. Note that this is a multidimensional integral over 50-year trajectories of Niño 3 SST and equatorial thermocline depth. Symmetric Kullback-Leibler divergences have been previously used in the context of dimensionality reduction of PDFs 54 . This integral is evaluated using Monte Carlo integration. Specifically, for each pair of GCMs or observations samples from p M i (z) are generated using direct simulation of the stochastic model, and then D k p M i (z)||q M j (z) is obtained by taking a simple mean of the fraction term ln p M i (z) q M j (z) over the random z samples, and D k q M j (z)||p M i (z) is calculated in the similar way. We use 1000 Monte Carlo samples. Preliminary data analysis shows that this number of samples is sufficient to estimate the symmetric Kullback-Leibler divergence with reasonable accuracy.

Multidimensional scaling (MDS).
Multidimensional scaling takes in inter-model (or model-observational) distances, and finds model locations in the space of given dimensionality 55,56 . More specifically, the input to the method is the n × n matrix of inter-model distances Ŵ = (γ ij ) . These distances are first converted to the n × n matrix A = (a ij ) , where a ij = − 1 2 γ 2 ij . Next, the symmetric n × n matrix B = HAH is formed, where H = I n − n −1 J n and J n is an n × n matrix of ones. The matrix B is then spectrally decomposed to find eigenvalues and eigenvectors: Here, we select the MDS dimension e using cross-validation by excluding each model one at-a-time, and comparing multi-model projections (see "Cross-Validation" subsection) to the excluded model outputs. We achieve the best mean absolute error of the mean for e = 5, and best mean absolute error of the mode for e = 6. We www.nature.com/scientificreports/ therefore combine results for these dimensionalities when performing future projections (subsections below). We perform MDS twice: first with observations, canonical 24 models, additional runs of three GCMs, pseudoobservations, and second set of alternative observations (Supplementary Note). The purpose of this analysis is to illustrate the uncertainties involves in the estimation of the MDS coordinates and the original PDFs. The second time we perform the MDS is for the purpose of multi-metric future projections. Here, we use only the observations and the canonical r1i1p1f1 runs of the 24 GCMs. When performing the MDS the first row of the distance matrix always involves the default observational set, while the rest of the rows involve climate models, pseudo-observations or additional observations. Future multi-model projections. We define hypotheses about climate model correctness by modifying previous work 18 as regions in parameter space. Consider that each ith model is associated with the coordinate θ i = r 1,m i , . . . , r e,m i , � m i in the space of e MDS coordinates and future changes . Note that here changes represent long-term changes between the historical (years 1850-1949) and the future period (years 2040-2099). These coordinates can be normalized to have the inter-model range of one in each dimension; let the normalized coordinates be: where G j is the range of model coordinates in the dimension j, and G is the range of future modelled changes. Each model i is considered to be correct ( H i ) under a given "true" future climate change if the distance between its normalized coordinates θ * i and the normalized coordinates of the truth (consisting of the normalized MDS p osit ions of obs er vat ions and t he nor ma lize d "t r ue" f uture climate change � * ) θ * = r * 1,o , . . . , r * e,o , � * = r 1,o G 1 , . . . , r e,o G e , � G � is within a given value k: The difference between this formulation and the one previously employed 18 is that the current formulation uses hyperspheres in an e-dimensional space as model correctness regions, whereas the previous formulation uses hypercubes. The hypersphere formulation appears to be more intuitive to us, since it, unlike the hypercube formulation, could be represented using distances. In addition, its region boundaries are smooth, unlike those of the hypercube formulation.
Furthermore, we define sample space as the region in space ( �, k) where at least one climate model is correct: H 1 ∪ · · · ∪H l , where l = n-1 is the number of models. We choose the prior for k to be a half-normal distribution k ∼ N + (0, σ 2 ) following prior work 18 . Here, σ is the parameter chosen by the user. We find an appropriate value for it using cross-validation (described below). Future projections are made using the marginalization theorem: where 1 is an indicator function for membership in the set (it is one over the set , and 0 outside of this set). Samples from 1 � p(k) are taken using rejection sampling. We perform future projections using runs r1i1p1f1 from the 24 selected models.
We address the uncertainty in the number of dimensions e by combining projections from e = 5 and 6, as these were the best-performing dimensionalities during cross-validation (see subsection below).

Cross-validation.
We use cross-validation to optimize the number of MDS dimensions, as well as the parameter σ . We try MDS dimensionalities e ranging from 2 to 8. For each dimensionality, each model is excluded from the analysis one-by-one (except the model CAS-ESM2-0, which is an outlier in the MDS space), multi-model projections of Niño 3 standard deviation changes for years 1850-1949 to 2040-2099 are performed using the remaining 23 models and compared with the modelled change in the excluded model. Then, an appropriate σ is obtained such that it provides well-calibrated confidence intervals for the standard deviation changes (the 90% confidence intervals having an empirical coverage of around 90%). 50,000 samples from the joint PDF are used. An additional experiment in 5 MDS dimensions with a different initial random seed indicates that 50,000 samples are enough to reasonably approximate the marginal PDF of future projections. We analyze the mean absolute error of the mean and mean absolute error of the mode for different dimensions. We find that the mean absolute error of the mean is smallest for e = 5, while the mean absolute error of the mode -for e = 6. Without any further information on the likelihood of different e values, we simply combine projections from both dimensions (which is effectively Bayesian model averaging with an equal probability for two competing models with different values of the hyperparameter e).
A sensitivity cross-validation experiment for e = 5 where all MDS dimensions have been equally scaled using the mean range G ′ = 1 5 5 j=1 G j has resulted in much wider mean 90% confidence interval compared to the baseline method. We therefore choose individual scaling factors G j for each MDS coordinate.
We also use cross-validation to determine the proper smoothing for the binned kernel density estimate for the PDFs in the case of the poor-man's approach. Specifically, the bandwidth in the kernel density method is S × d , where S is the smoothing parameter, and d is the two-stage direct plug-in bandwidth estimate 57 . We select S for which 90% of the time the "true" changes are within the 90% confidence intervals of the kernel-smoothed (9) p(�) = p(�, k)dk = p(�|k)p(k)dk ∝ 1 � p(k)dk,